Allocating and using file descriptors for an application executing on a plurality of nodes

ABSTRACT

A method for allocating and using file descriptors for an application executing over a plurality of nodes, each having a file system, includes receiving a system call from the application running on a first node to access a file in a file system, determining whether the file resides in a file system of a first node or the second node, and upon determining that the file resides on the second node, sending the system call and arguments of the system call to the one of the second nodes for execution on the one of the second nodes and returning a result of the system call executed on the second node to the application on the first node.

CROSS-REFERENCE TO RELATED APPLICATION(S)

This application claims the benefit of U.S. Provisional Application No.63/164,955, filed on Mar. 23, 2021, which is incorporated by referenceherein.

BACKGROUND

Data volume is increasing due to artificial intelligence (AI) and deeplearning applications. This increase in data volume requires acommensurate increase in compute power. However, microprocessors cannotsupply the needed compute power. Consequently, specializedarchitectures, such as accelerators and coprocessors, are taking overmany of the compute tasks. These specialized architectures need to shareaccess to large portions of system memory to achieve significantperformance improvement.

Using specialized architectures creates new problems to be solved.Virtualizing specialized architectures is difficult, requiring highinvestment and strong vendor support because the architectures areusually proprietary.

One solution is intercepting the programming interfaces for thearchitecture, i.e., the application programming interfaces (APIs). Inthis solution, the intercepted APIs are sent to a node, on which aparticular specialized architecture (such as graphics processing units(GPUs) of a particular vendor) is installed and executed on that node.The execution relies on distributed shared memory (DSM) between centralprocessing units (CPUs) and the GPUs. When tight memory coherence isneeded between the CPUs and GPUs, remote procedure calls (RPCs) areused, which requires high traffic between nodes and highly detailedknowledge of the API semantics and the GPUs.

A better solution is needed, i.e., one that can handle specializedarchitectures of not just one but many different vendors on the samenode without requiring specialized knowledge of the specializedarchitecture.

SUMMARY

One embodiment provides a method for allocating and using filedescriptors for an application executing over a plurality of nodes,including a first node and a second node, each having a file system. Themethod includes executing a system call from the application running ona first node to access a file in a file system, determining whether thefile resides in a file system of the first node or the second node. Themethod further includes, if the file resides on the second node, sendingthe system call and arguments of the system call to the second node forexecution on the second node, receiving a result from the system callthat is executed on the second node, and returning the result to theapplication on the first node.

Further embodiments include a device configured to carry out one or moreaspects of the above method and a computer system configured to carryout one or more aspects of the above method.

BRIEF DESCRIPTION OF DRAWINGS

FIG. 1 depicts an arrangement for accessing banks of GPUs in the priorart.

FIG. 2 depicts an arrangement for accessing banks of accelerators,according to an embodiment.

FIG. 3 depicts a representative system in which embodiments may operate.

FIG. 4A depicts a flow of operations for an initiator node setup,according to an embodiment.

FIG. 4B depicts a flow of operations for an acceptor node setup,according to an embodiment.

FIG. 4C depicts a flow of operations for loading an application,according to an embodiment.

FIG. 4D depicts a flow of operations for creating threads for anapplication, according to an embodiment.

FIG. 5A depicts a flow of operations for running the initiator node,according to an embodiment.

FIG. 5B depicts a flow of operations for running an acceptor node,according to an embodiment.

FIG. 6A depicts a flow of operations for implementing a system call onthe initiator node, according to an embodiment.

FIG. 6B depicts a flow of operations for implementing a system call onthe acceptor node, according to an embodiment.

FIG. 6C depicts a flow of operations for implementing a Detect Localfunction, according to an embodiment.

FIG. 7 depicts a flow of operations for loading a program file and adynamic linker, according to an embodiment.

FIG. 8A depicts components in an initiator node and an acceptor nodeinvolved in setting up the initiator and acceptor nodes, according to anembodiment.

FIG. 8B depicts a flow of operations between initiator and acceptornodes during address space synchronization, according to an embodiment.

FIG. 8C depicts a flow of operation between an initiator and acceptornodes during the creation of a coherent application, according to anembodiment.

FIG. 8D depicts a flow of operations between an initiator and acceptornodes during the establishment of runtimes, according to an embodiment.

FIG. 9 depicts a flow of operations for accessing a file, according toan embodiment.

DETAILED DESCRIPTION

In the embodiments, an application is co-executed among a plurality ofnodes, where each node has installed thereon a plurality of specializedarchitecture coprocessors, including those for artificial intelligence(AI) and machine learning (ML) workloads. Such applications have theirown runtimes, and these runtimes offer a way of capturing theseworkloads by virtualizing the runtimes. New architectures are easier tohandle because of the virtualized runtime, and coherence among nodes isimproved because the code for a specialized architecture runs locally tothe specialized architecture. An application monitor is established oneach of the nodes on which the application is co-executed. Theapplication monitors maintain the needed coherence among the nodes tovirtualize the runtime and engages semantic-aware hooks to reduceunnecessary synchronization in the maintenance of the coherence.

FIG. 1 depicts an arrangement for accessing banks of GPUs in the priorart. In the arrangement depicted, users 102 interact through avirtualized cluster of hosts 104, which is connected via a network 112to nodes 106, 108, 110, containing a bank of GPUs of a particularvendor. Each node 106, 108, and 110 is a server with a hardware platformand an operating system. Each node is configured with the GPUs of theparticular vendor. Compute nodes in virtualized cluster of hosts 104send APIs, which are specific to the GPUs, to nodes 106, 108, 110 forexecution on the GPUs.

FIG. 2 depicts an arrangement for accessing banks of accelerators,according to an embodiment. In the arrangement depicted, users 102interact through a virtualized cluster of hosts 104, which is connectedvia a network 112 to nodes 206, 208, 210, where each node is aserver-type architecture having a hardware platform, operating system,and possibly a virtualization layer. The hardware platform includesCPUs, RAM, network interface controllers, and storage controllers. Theoperating system may be a Linux® operating system or Windows® operatingsystem. A virtualization layer may be present, and the above-operatingsystems may operate above the virtualization layer. In addition, in thefigure, each node contains banks of heterogeneous accelerators. That is,each node 206, 208, 210 can contain many different types ofaccelerators, including ones from different vendors. Compute nodes invirtualized cluster of hosts 104 send requests to nodes 206, 208, 210 torun portions of applications installed in the computer nodes, on aruntime installed on nodes 206, 208, 210.

In an alternative embodiment, nodes 206, 208, 210 are nodes with largeamounts of memory, and portions of a large database or other applicationare installed on the nodes 206, 208, 210 to run thereon, takingadvantage of the node with the large amounts of memory. Portions of theapplication are targeted for execution on nodes having large amounts ofmemory instead of specific accelerators.

Languages often used for programming the specialized architectures oraccelerators include Python®. In the Python language, the source code isparsed and compiled to byte code, which is encapsulated in Python codeobjects. The code objects are then executed by a Python virtual machinethat interprets the code objects. The Python virtual machine is astack-oriented machine whose instructions are executed by a number ofco-operating threads. The Python language is often supplemented withplatforms or interfaces that provide a set of tools, libraries, andresources for easing the programming task. One such platform isTensorFlow®, in which the basic unit of computation is a computationgraph. The computation graph includes nodes and edges, where each noderepresents an operation, and each edge describes a tensor that getstransferred between the nodes. The computation graph in TensorFlow is astatic graph that can be optimized. Another such platform is PyTorch®,which is an open-source machine-learning library. PyTorch also employscomputational graphs, but the graphs are dynamic instead of static.Because computation graphs provide a standardized representation ofcomputation, they can become modules deployable for computation over aplurality of nodes.

In the embodiments, an application is co-executed among a plurality ofnodes. To enable such co-execution, runtime and application monitors areestablished in each of the nodes. The runtimes are virtual machines thatrun a compiled version of the code of the application, and theapplication monitors co-ordinate the activity of the runtimes on each ofthe nodes.

FIG. 3 depicts a representative system in which embodiments may operate.The system includes two nodes, an initiator node 206 that starts up thesystem and thereafter operates as a peer node and one or more acceptornodes 208 (only one of which is depicted). Initiator node 206 andacceptor node 208 each include a process container 302, 308 containingan application 314, a runtime 316, 338, an application monitor 318, 340,one or more threads of execution 320, 346, data pages 324, 348, and codepages 322, 350 for the threads. Process container 302, 308 runs inuserspace. In one embodiment, process containers 302, 308 are Docker®containers, runtimes 316, 338 are Python virtual machines, application314 is a Python program, with libraries such as TensorFlow or PyTorch,and threads 320, 346 correspond to the threads of the Python virtualmachine. Application monitor 340 on initiator node 206 includes adynamic linker (DL) 344 and a configuration file 342 for configuring theparticipating nodes. In general, a dynamic linker is a part of an OSthat loads and links libraries and other modules as needed by anexecutable code while the code is being executed. Alternatively, theinitiator node sets up an acceptor node to have an application monitorwith a DL and configuration file, and the application program is loadedonto the acceptor node.

Each node 206, 208 further includes an operating system 304, 310, and ahardware platform 306, 312. Operating system 304, 310, such as theLinux® operating system or Windows® operating system, provides theservices to run process containers 302, 308. In some embodiments,operating system 304, 310 runs on hardware platform 306, 312. In otherembodiments, operating system 304, 310 is a guest operating systemrunning on a virtual hardware platform of a virtual machine that isprovisioned by a hypervisor from hardware platform 306, 312. Inaddition, operating system 304, 310 provides a file system 364, 366,which contains files and associated file descriptors, each of which isan integer identifying a file.

Hardware platform 306, 312 on the nodes respectively includes one ormore CPUs 326, 352, system memory, e.g., random access memory (RAM) 328,354, one or more network interface controllers (NICs) 330, 356, astorage controller 332, 358, and a bank of heterogeneous accelerators334, 360. The nodes are interconnected by network 112, such asEthernet®, InfiniBand, or Fibre Channel.

Before running an application over a plurality of nodes, the nodes areset up. Setup of the initiator node 206 and acceptor node 208 includesestablishing the application monitor and runtimes on each of the nodeson which libraries or other deployable modules are to run, the coherentmemory spaces in which the application, libraries or other deployablemodules are located, and the initial thread of execution of eachruntime. With the setup complete, the application monitors and runtimesin each node co-operate to execute the application among the pluralityof nodes.

FIGS. 4A-4D depict a flow of operations for an initiator node 206 setupand an acceptor node 208 setup, according to an embodiment.Specifically, FIG. 4A depicts a flow of operations for an initiator nodesetup, according to an embodiment. FIG. 4B depicts a flow of operationsfor an acceptor node setup, according to an embodiment. FIG. 4C depictsa flow of operations for loading an application, according to anembodiment. FIG. 4D depicts a flow of operations for creating threadsfor an application, according to an embodiment.

Referring to FIG. 4A, on start-up, initiator node 206 establishes aconnection to acceptor node 208 in step 402. In step 404, initiator node206 establishes an application monitor and a runtime on initiator node206 and sends a message requesting that acceptor node 208 establish anapplication monitor and runtime thereon. Initiator node 206 thenperforms a coherent load of an application binary (step 406, furtherdescribed with reference to FIG. 4C). In step 408, initiator node 206may load a library if needed. In step 412, further described withreference to FIG. 4D, a thread is started using this stack, with anentry point being the application's ‘main’ function.

Referring to FIG. 4B, on start-up, acceptor node 208 receives a messageto establish application monitor 318 and runtime 316 in step 420. Instep 422, acceptor node 208 receives the library or other deployablemodule from initiator node 206, and in response, loads the received codefor the library or other deployable module. In step 424, acceptor node208 receives the request to create memory space from initiator node 206and, in response, creates the memory space at the specified location. Instep 426, acceptor node 208 receives a request to create the stackaddress space from initiator node 206 and, in response, creates andlocates the requested stack address space. Acceptor node 208 thenreceives, in step 428, a command from initiator node 206 to form a dual(shadow) thread based on the execution thread in initiator node 206 and,in response, establishes the requested dual thread.

Referring to FIG. 4C, in step 432, initiator node 206 synchronizesaddress spaces. In step 434, initiator node 206 establishes avirtualization boundary. Establishing the boundary includes creating asub-process (called VProcess below) that shares an address space withits parent process and can have its system calls traced by the parent.The parent process detects the sub-process interactions with theoperating system and ensures that these interactions are made coherentlywith the other node or nodes. In step 436, initiator node 206 loads theapplication binary and an ELF (Executable and Linkable Format)interpreter binary into the address space inside the virtualizationboundary. The parent process detects this address space manipulationthrough tracing and keeps the acceptor node coherent with changes madeby the sub-process. In step 438, initiator node 206 populates an initialstack for the ELF interpreter binary inside the virtualization boundary,and in step 440, initiator node 206 starts executing the ELF interpreterbinary on its own stack inside the virtualization boundary. Executioninside the virtualization boundary assures that address spaces andexecution policies are coherent between the initiator and acceptor nodesand that any changes made by the runtime are intercepted so thatconsistency of the loaded application is maintained.

Executing the ELF interpreter binary inside the virtualization boundarymay entail loading a library on the initiator or acceptor node andpossibly establishing a migration policy regarding the library (e.g.,pinning the library to a node, e.g., the acceptor node). Additionally,the ELF interpreter binary may establish additional coherent memoryspaces, including stack spaces needed by the application.

In an alternative embodiment, instead of loading the application binaryon initiator 206 in step 434, initiator 206 sends to acceptor 208 acommand which contains instructions about how to load the applicationbinary, and acceptor 208 processes these instructions to load theapplication binary on itself.

Referring to FIG. 4D, coherent execution threads are established bystarting an execution thread using the just created stack in step 408.In step 484, a command to form a dual execution thread corresponding toan execution thread on the local node is sent to acceptor node 208. Instep 486, the thread information is returned. The dual thread is pausedor parked, awaiting a control transfer request from the local node. Whenexecution moves from one node to another, the register state of thelocal thread is recorded and sent to the other node as the local threadis parked. The other node receives the register state and uses it toresume the parked dual thread. In this way, the previously active threadbecomes the inactive thread, and the inactive thread becomes thecurrently active thread. The movement of the active thread is furtherdescribed with respect to FIGS. 6A and 6B.

An MSI-coherence protocol applied to pages maintains coherence betweenmemory spaces on the nodes so that the threads of the runtime areoperable on any of the nodes. A modified (state ‘M’) memory page in onenode is considered invalid (state ‘I’) in another. A shared (state ‘S’)memory page is considered read-only in both nodes. A code or data accessto a memory page that is pinned to acceptor node 208 causes executionmigration of the thread to acceptor node 208 followed by migration ofthe page; a data access to a memory page that is migratory triggers amigration of that memory page in a similar manner. In an alternateembodiment, upon a fault caused by an instruction accessing a code ordata page on acceptor node 208, only the instruction is executed on thenode having the code or data page, and the results of the instructionare transferred over the network to the acceptor node.

FIGS. 5A-5B describes interactions of running the application on theinitiator and acceptor nodes after the setup according to FIGS. 4A-4D iscompleted. These interactions include, in the course of executing theapplication on the initiator node, executing a library or otherdeployable module on the acceptor node. Executing the library or otherdeployable module involves ‘faulting in’ the code pagers for the libraryor other deployable module, the data pages of the stack or other memoryspace, and moving execution back to the initiator node.

FIG. 5A depicts a flow of operations for running the initiator node,according to an embodiment. In step 502, acceptor node 208 is optionallypre-provisioned with stack or memory pages anticipated for executingthreads on acceptor node 208 as described below. In step 504, acceptornode 208 is optionally pre-provisioned with functions of the library orother deployable module anticipated for the code. In step 506, the stateof the thread is set to running. In step 508, the initiator codeexecutes application 314 using the now running thread on initiator node206. In step 510, the thread determines whether the execution of afunction of a library or other deployable module is needed. If not, thenthe thread continues execution of its workload. If execution of alibrary or module function is needed, then in step 512, a message issent to acceptor node 208 to migrate the workload of the thread toacceptor node 208. In step 514, the state of the local thread is set toa parked state, which means that the thread is paused but runnable onbehalf of a dual thread on acceptor node 208. In step 516, initiatornode 206 awaits and receives a message to migrate the workload of thethread back to initiator node 206 after acceptor node 208 has finishedexecuting the function of the library or other deployable module.

Pre-provisioning of the memory pages or stack pages is performed by aDWARF-type (debugging with attributed record formats) debugger data.When initiator node 206 takes a fault on entry to the acceptor-pinnedfunction, it analyzes the DWARF data for the target function, determinesthat it takes a pointer argument, sends the memory starting at thepointer to acceptor node 208, and sends the current page of the stack toacceptor node 208. The DWARF debugger data contains the address andsizes of all functions that can be reached from this point in the callgraph, allowing the code pages to be sent to acceptor node 208 prior tobeing brought in by demand-paging. In this way, acceptor node 208 canpre-provision the memory it needs to perform its function prior toresuming execution.

FIG. 5B depicts a flow of operations for running an acceptor node,according to an embodiment. In step 552, the state of the local threadis initially set to parked. In step 554, one of five events occurs onacceptor node 208. The events are ‘migrate to acceptor’, ‘module fault’,‘stack fault’, ‘application code execution’, or ‘default’. The modulefault and stack fault, though specifically described, are examples of amemory fault which may include other types of memory faults, such as aheap fault and code fault, not described. The different types of memoryfaults are handled in a similar manner.

If the event is ‘migrate to acceptor’, then the state of the localthread is set to running in step 556. Flow continues to step 574, whichmaintains the thread's current state, and to step 576, where acceptornode 208 determines whether the thread is terminated. If not, controlcontinues to step 554 to await the next event, such as a ‘libraryfault’, a ‘stack fault’, ‘execution of the application’.

If the event is a ‘module fault’, e.g., a library fault, then the stateof the thread is set to parked in step 558, and in step 560, acceptornode 208 requests and receives a code page of the library or otherdeployable module not yet paged in from initiator node 206. In step 562,acceptor node 208 sets the state of the local thread to running, and theflow continues with the local thread running through steps 574, 576, 554to await the next event if the thread is not terminated.

If the event is a ‘stack fault’, then the thread's state is set toparked in step 564, and the initiator node 206 sends a request toreceive a stack page not yet paged in from initiator 206. In step 568,the thread's state is set to running, and the flow continues throughsteps 574, 576, and 554 to await the next event assuming no threadtermination.

If the event is ‘application code execution’, then the state of thelocal thread is set to parked in step 570, and acceptor node 208 sends a‘migrate control’ message to initiator node 206 in step 572. Flowcontinues through steps 574, 576, and 554 to await the next event.

If the event is ‘default’ (i.e., any other event), then the thread'sstate is maintained in step 574, and flow continues through steps 576and 554 to await the next event.

If the thread terminates as determined in step 576, the stack is sentback to initiator node 206 in step 578, and flow continues at step 554,awaiting the next event. If no event occurs, then ‘default’ occurs,which loops via steps 574 and 554 to maintain the thread's currentstate.

Often in the course of execution of the application, operating systemservices are needed. The application, via the runtime on a particularnode, makes system calls to the operating system to obtain theseservices. However, the particular node making the system call may nothave the resources for executing the system call. In these cases, theexecution of the system call is moved to a node having the resources.FIGS. 6A-6C depict the flow of operations to execute and possible moveexecution of a system call. Specifically, FIG. 6A depicts a flow ofoperations for implementing a system call on the initiator node,according to an embodiment. FIG. 6B depicts a flow of operations forimplementing a system call on the acceptor node, according to anembodiment. FIG. 6C depicts a flow of operations for implementing aDetect Local function, according to an embodiment.

Referring to FIG. 6A, in step 602, a thread running in the local nodemakes a system call. In step 604, the application monitor on the localnode receives the system call via a program that is responsible formanipulating interactions with the virtualization boundary (calledVpExit below). In step 606, the application monitor determines whetherthe arguments involve local or remote resources. In step 608, if thesystem call involves remote resources (‘No’ branch), then the runningthread is parked, and in step 610, the application monitor sends thesystem call and its arguments to the application monitor on the remotenode that is to handle the system call. In step 612, the applicationmonitor on the local node awaits completion and results of the systemcall, and in step 614, the running thread receives the results of thesystem call (via VpExit) and is made active again. In step 608, if thesystem call involves only local resources (‘Yes’) branch, then the localnode handles the system call in step 616.

Referring now to FIG. 6B, in step 632, the application monitor on theremote node receives the system call and its arguments. In step 634, thestate of the parked thread is set to active (i.e., running) and theremote node handles the system call in step 636. In step 638, theresults of the system call are returned to the thread that made thecall, which provides in step 640 the results to the application monitor,after which in step 642, the state of the thread is set back to theparked state. In step 644, the application monitor sends the completionand results back to the local node.

Referring now to FIG. 6C, the flow of operations depicted in the figureoccurs in response to executing step 606. In step 652, the function getsall of the system call arguments and in step 654 determines for systemcalls, other than a file access, whether the arguments interact with aresource pinned on another node, which is either a different acceptornode or the initiator node. If so, then the function returns ‘True’ instep 656. Otherwise, the function returns ‘False’ in step 658. If thesystem call is a file access, then the flow executes step 655, which isfurther described with reference to FIG. 9.

FIG. 7 depicts a flow of operations for loading a program file and adynamic linker, according to an embodiment. The flow of operations ofFIG. 7 describes in more detail the step of loading the applicationaccording to step 432 of FIG. 4C, where the loading is performed by theoperating system, the application monitor, and the dynamic linker.

In step 702, application monitor 340 loads the ELF program file and getsa file system path for the ELF interpreter binary. In step 706,application monitor 340 prepares an initial stack frame for a binary ofapplication program 314 (hereinafter referred to as “primary binary”).In step 708, application monitor 340 acquires the primary binary usingthe ELF interpreter and informs the binary of the initial stack frame.In step 708, application monitor 340 starts DL 344, which was loaded byoperating system 310. In step 710, DL 344 runs, and in step 712, DL 344relocates the primary binary and DL 344 to executable locations, whichare locations in system memory from which code execution is allowed bythe OS. In step 714, DL 344 loads the program dependencies (of thelibrary or other deployable module) and alters the system call table tointercept all system calls made by the primary binary. Some system callsare allowed through unchanged, while others are altered when DL 344interacts with operating system 310. In step 716, DL 344 causes therelocated primary binary of application program 314 to run at theexecutable location. As a result, both application program 314 and DL344 run in userspace. Running in user space allows loading of thelibrary or other deployable to be within the virtualization boundary.

DL 344 can replace certain function calls that go through the library orother deployable modules with customized versions to add functionalaugmentation based on known semantics. In allocating address space using‘mmap’ or ‘sbreak’, DL 344 assures, via the application monitor, thatthreads see a consistent view of the address space, so execution ofthreads may migrate over the nodes. In addition, a ‘ptrace’ system callis used to track the execution of DL 344 to find how it interacts withoperating system 310. Interactions are then rewritten so that they runcoherently between initiator node 206 and acceptor node 208. Ultimately,all interactions with operating system 310 go through symbols defined byDL 344 or resolved through DL 344.

FIGS. 8A-8D describe the components and operations in more detail duringthe setup of the initiator node and acceptor node corresponding to steps404, 442, 446, 450, 464, 466 of FIGS. 4A-4D. Specifically, FIG. 8Adepicts components in an initiator node and an acceptor node involved insetting up the initiator and acceptor nodes, according to an embodiment.FIG. 8B depicts a flow of operations between initiator and acceptornodes during address space synchronization, according to an embodiment.FIG. 8C depicts a flow of operation between an initiator and acceptornodes during the creation of a coherent application, according to anembodiment. FIG. 8D depicts a flow of operations between an initiatorand acceptor nodes during the establishment of runtimes, according to anembodiment.

Referring to FIG. 8A, initiator node 206 includes a VProcess 802, aRuntime module 804, a Bootstrap module 806, and a VpExit module 808.Acceptor node 208 includes similar components 822, 824, 826, 828 as oninitiator node 206, along with an Init module 830. VpExit modules 808and 828 are responsible for manipulating VProcess 802 and 822interactions across their respective virtualization boundaries.

Referring now to FIG. 8B, in step 832, the acceptor Init module 830receives a ‘hello function’ designating the address space from initiatorRuntime 804. In step 834, acceptor Init module 830 sends a ‘createVpExit’ message to acceptor Bootstrap module 826. In step 836, acceptorInit module 830 sends an acknowledgment regarding the address spacemessage back to initiator node 206. At this point, a synchronizedaddress space is established between the initiator 207 and acceptor 208.

Referring to FIG. 8C, in step 838, initiator node 206 sends a ‘createVpExit’ message to initiator Bootstrap module 806. In step 840,initiator node 206 sends a ‘create’ message to VProcess 802 of initiatornode 206, which receives a ‘load VpExit’ message in step 842 frominitiator 206. At this point, VProcess 802 is created outside of theRemote Procedure Call (RPC) layer, and the resources that VProcess 802uses are virtualized. In step 844, VProcess 802 sends a ‘Mmap’ messageto VpExit module 808 of initiator node 206, which sends a ‘mmap’ messagein step 845 to initiator 206 and an ‘update the address map’ message instep 846 to Bootstrap module 826 of acceptor node 208. In step 848,Bootstrap module 826 of acceptor node 208 sends an acknowledgment (‘ok’)back to initiator node 206, which relays the message in step 850 toVpExit module 808, which relays the message to VProcess 802 in step 852.At this point, the address map of the application on the initiator ismade coherent with the acceptor node.

Referring to FIG. 8D, in step 854, initiator VProcess 802 sends a‘VpExit(Enter, hook_page)’ message to VpExit module 808. In step 856,VpExit module 808 sends an ‘Enter(hook_page)’ message to initiatorBootstrap module 806. In step 858, initiator Bootstrap module 806 sendsa create(VpExit)′ message to initiator Runtime 804. In step 860,initiator Bootstrap module 806 sends a ‘bootstrap(Runtime, hook_page)’message to acceptor Bootstrap module 826, which sends in step 862 an‘install(VpExit, hook_page)’ message to acceptor Runtime module 824. Instep 864, acceptor Runtime module 824 sends an ‘install(VpExit)’ messageto acceptor VProcess 822. In step 866, acceptor Bootstrap module 826sends a ‘Runtime’ message to initiator Bootstrap module 806, whichreturns in step 868 to VpExit module 808, which returns in step 870 toVProcess 802. At this point, initiator node 206 and acceptor node 208have both created runtimes for VProcess 802, VProcess 822, and thememory and address space for VProcess 802 and 822 are coherent.

During bootstrap, initiator node 206, in one embodiment, uses the system‘ptrace’ facility to intercept system calls generated by the virtualprocess. The application monitor runs in the same address space as thevirtual process, which means that the application monitor is in the samephysical process as the virtual process. In one embodiment, Linux'sclone(2) system call allows the virtual process to be traced. Thevirtual process issues SIGSTOP to itself, which pauses execution of thevirtual process before allocating any virtual process resources. Theapplication monitor attaches to the virtual process via ‘ptrace’, whichallows it to continue execution (using SIGCONT) from the point at whichthe virtual process entered SIGSTOP. Using ‘ptrace’, the applicationmonitor can intercept and manipulate any system calls issued by thevirtual process to preserve the virtualization boundary. Afterbootstrap, VProcess interactions with the operating system are detectedby the syscall intercept library.

FIG. 9 depicts a flow of operations for accessing a file, according toan embodiment. As mentioned above, a file system resides on each of thenodes. Access to one or more files in the file systems may be requestedby the application during execution by making a system call. If therequested file resides on a node making the system call, the file isavailable locally. However, if the file resides on a different node(another acceptor node or the initiator node), the system call isremotely executed according to FIGS. 6A-6C. According to step 655 ofFIG. 6C, the system call determines whether the arguments of the systemcall interact with a remote pinned resource, which is a file that is notlocal to the node receiving the system call. The steps of FIG. 9 depictthe use of the file descriptor, which was returned during a previoussystem call in which the file was opened to determine which node onwhich the system call is to be executed.

Referring to FIG. 9, in step 900, the flow tests the file descriptoragainst a criterion. In one embodiment, the criterion is whether thefile descriptor obtained in step 654 of FIG. 6C (during anopen(filename) or other system call which returns the file descriptorfd) is even or not. If the file descriptor is an even integer, asdetermined in step 902, initiator node 206 is determined to have thefile in step 904 because only files with even fds can be stored on theinitiator. If the current node is initiator node 206, as determined instep 910, then a ‘False’ value is returned in step 916. The ‘False’value indicates that the system call arguments do not interact with aremote pinned resource, and the system call is handled locally. If thecurrent node is acceptor node 208 as determined in step 912, then a‘True’ value is returned in step 914. The ‘True’ value indicates thatthe system call arguments do interact with a remote pinned resource, andthe system call is to be handled remotely.

If the file descriptor is an odd integer, then acceptor node 208 isdetermined to have the file in step 906 because only files with an oddfds can be stored on the acceptor node, and in step 916, a ‘False’ valueis returned, where an odd fd is one that is odd modulo the number ofacceptor nodes (i.e., odd=fd mod #acceptors). Otherwise, a ‘True’ valueis returned in step 914, where ‘False’ indicates the needed resource islocal and a ‘True’ indicates that the needed resource is remote.

In an alternative embodiment, the criterion is whether the filedescriptor is less than a specified integer, say 512. If so, asdetermined in step 902, initiator node 206 is determined to have thefile in step 904 because only files with fds less than 512 are stored onthe initiator. If the current node is initiator node 206, as determinedin step 910, then a ‘False’ value is returned in step 916. The ‘False’value indicates that the system call arguments do not interact with aremote pinned resource, and the system call is handled locally. If thecurrent node is acceptor node 208 as determined in step 912, then a‘True’ value is returned in step 914. The ‘True’ value indicates thatthe system call arguments do interact with a remote pinned resource, andthe system call is to be handled remotely.

If the file descriptor is greater than 512, then acceptor node 208 isdetermined to have the file in step 906 because only files with fdsgreater than 512 are stored on the acceptor node, and in step 916, a‘False’ value is returned. Otherwise, a ‘True’ value is returned in step914.

Certain embodiments as described above involve a hardware abstractionlayer on top of a host computer. The hardware abstraction layer allowsmultiple contexts to share the hardware resource. These contexts areisolated from each other in one embodiment, each having at least a userapplication program running therein. The hardware abstraction layer thusprovides benefits of resource isolation and allocation among thecontexts. In the foregoing embodiments, virtual machines are used as anexample for the contexts and hypervisors as an example for the hardwareabstraction layer. As described above, each virtual machine includes aguest operating system in which at least one application program runs.It should be noted that these embodiments may also apply to otherexamples of contexts, such as containers not including a guest operatingsystem, referred to herein as “OS-less containers” (see, e.g.,www.docker.com). OS-less containers implement operating system-levelvirtualization, wherein an abstraction layer is provided on top of thekernel of an operating system on a host computer. The abstraction layersupports multiple OS-less containers, each including an applicationprogram and its dependencies. Each OS-less container runs as an isolatedprocess in userspace on the host operating system and shares the kernelwith other containers. The OS-less container relies on the kernel'sfunctionality to make use of resource isolation (CPU, memory, block I/O,network, etc.) and separate namespaces and to completely isolate theapplication program's view of the operating environments. By usingOS-less containers, resources can be isolated, services restricted, andprocesses provisioned to have a private view of the operating systemwith their own process ID space, file system structure, and networkinterfaces. Multiple containers can share the same kernel, but eachcontainer can be constrained only to use a defined amount of resourcessuch as CPU, memory, and I/O.

Certain embodiments may be implemented in a host computer without ahardware abstraction layer or an OS-less container. For example, certainembodiments may be implemented in a host computer running a Linux® orWindows® operating system.

The various embodiments described herein may be practiced with othercomputer system configurations, including hand-held devices,microprocessor systems, microprocessor-based or programmable consumerelectronics, minicomputers, mainframe computers, and the like.

One or more embodiments of the present invention may be implemented asone or more computer programs or as one or more computer program modulesembodied in one or more computer-readable media. The termcomputer-readable medium refers to any data storage device that canstore data which can thereafter be input to a computer system.Computer-readable media may be based on any existing or subsequentlydeveloped technology for embodying computer programs in a manner thatenables them to be read by a computer. Examples of a computer-readablemedium include a hard drive, network-attached storage (NAS), read-onlymemory, random-access memory (e.g., a flash memory device), a CD(Compact Discs)—CD-ROM, a CDR, or a CD-RW, a DVD (Digital VersatileDisc), a magnetic tape, and other optical and non-optical data storagedevices. The computer-readable medium can also be distributed over anetwork-coupled computer system so that the computer-readable code isstored and executed in a distributed fashion.

Although one or more embodiments of the present invention have beendescribed in some detail for clarity of understanding, it will beapparent that certain changes and modifications may be made within thescope of the claims. Accordingly, the described embodiments are to beconsidered as illustrative and not restrictive, and the scope of theclaims is not to be limited to details given herein but may be modifiedwithin the scope and equivalents of the claims. In the claims, elementsand/or steps do not imply any particular order of operation unlessexplicitly stated in the claims.

Plural instances may be provided for components, operations, orstructures described herein as a single instance. Finally, boundariesbetween various components, operations, and data stores are somewhatarbitrary, and particular operations are illustrated in the context ofspecific illustrative configurations. Other allocations of functionalityare envisioned and may fall within the scope of the invention(s). Ingeneral, structures and functionality presented as separate componentsin exemplary configurations may be implemented as a combined structureor component. Similarly, structures and functionality presented as asingle component may be implemented as separate components. These andother variations, modifications, additions, and improvements may fallwithin the scope of the appended claim(s).

What is claimed is:
 1. A method for allocating and using filedescriptors for an application executing over a plurality of nodes,including a first node and a second node each having a file system, themethod comprising: executing a system call from the application runningon a first node to access a file in a file system; determining whetherthe file resides in a file system of the first node or the second node;upon determining that the file resides on the second node, sending thesystem call and arguments of the system call to the second node forexecution on the second node, and returning a result to the applicationon the first node.
 2. The method of claim 1, wherein a file descriptoris obtained by the application performing an open system call using thefile name of the file to be accessed, and determining whether the fileresides in a file system of the first node or one of the second nodesincludes testing the file descriptor against a criterion.
 3. The methodof claim 1, wherein the system call from the application running on thefirst node is executed by a thread running on the first node, saidmethod further comprising: setting the thread to a parked state when thesystem call and arguments are sent to the second node for execution. 4.The method of claim 3, further comprising: setting the thread to arunning state when the result is returned to the application on thefirst node.
 5. The method of claim 4, further comprising: if the fileresides on the first node, handling the system call on the first nodeand returning the result to the application on the first node.
 6. Themethod of claim 1, wherein only files with a file descriptor meeting thecriterion are stored on the first node.
 7. The method of claim 1,wherein only files with a file descriptor not meeting the criterion arestored on the second node.
 8. A system for allocating and using filedescriptors for an application executing over a plurality of nodes, thesystem comprising: a first node having a file system installed thereon;and a second node having a file system installed thereon, wherein thefirst node is configured to: in response to a system call to access afile made by the application running on the first node: determinewhether the file resides in a file system of the first node or thesecond node; and upon determining that the file resides in the secondnode, send the system call and arguments thereof to the second node forexecution on the second node, and return the result to the applicationon the first node.
 9. The system of claim 8, wherein a file descriptoris obtained by the application performing an open system call using thefile name of the file to be accessed, and determining whether the fileresides in a file system on the first node or one of the second nodesincludes testing the file descriptor against a criterion.
 10. The systemof claim 8, wherein the system call from the application running on thefirst node is executed by a thread running on the first node, and thefirst node is further configured to set the thread to a parked statewhen the system call and arguments are sent to the second node forexecution.
 11. The system of claim 10, wherein the first node is furtherconfigured to set the thread to a running state when the result isreturned to the application on the first node.
 12. The system of claim8, wherein the first node is further configured to: if the file resideson the first node, handle the system call and return the result to theapplication.
 13. The system of claim 8, wherein only files with a filedescriptor meeting the criterion are stored on the first node.
 14. Thesystem of claim 8, wherein only files with a file descriptor not meetingthe criterion are stored on the second node.
 15. A non-transitorycomputer-readable medium comprising instructions, which when executed,carry out a method for allocating and using file descriptors for anapplication executing on a plurality of nodes including a first node anda number of second nodes, the method comprising: executing a system callfrom the application running on a first node to access a file in a filesystem; determining whether the file resides in a file system of thefirst node or the second node; upon determining that the file resides onthe second node, sending the system call and arguments of the systemcall to the second node for execution on the second node and returningthe result of the system call executed on the second node to theapplication on the first node.
 16. The non-transitory computer-readablemedium of claim 15, wherein a file descriptor is obtained by theapplication performing an open system call using the file name of thefile to be accessed, and determining whether the file resides in a filesystem of the first node or one of the second nodes includes testing thefile descriptor against a criterion.
 17. The non-transitorycomputer-readable medium of claim 15, wherein the system call from theapplication running on the first node is executed by a thread running onthe first node and said method further comprises: setting the thread toa parked state when the system call and arguments are sent to the secondnode for execution.
 18. The non-transitory computer-readable medium ofclaim 17, wherein the method further comprises: setting the thread to arunning state when the result is returned to the application on thefirst node.
 19. The non-transitory computer-readable medium of claim 15,wherein only files with a file descriptor meeting the criterion arestored on the first node.
 20. The non-transitory computer-readablemedium of claim 15, wherein only files with a file descriptor notmeeting the criterion are stored on the second node.